MEM T380
HW2a
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing
# set seaborn's default settings
sns.set()
Import Data¶We import the data into multiple dataframes by their subset for later use, but also can concatenate them for current use.
excel_file = 'weld_defect_dataset.xlsx'
subsets = []
for i in range(1, 6):
subset = pd.read_excel(excel_file, sheet_name = 'subset' + str(i))
subset = subset.rename(columns={'Type':'type',
'W':'w',
'Ar':'ar',
'Sp':'sp',
'Re':'re',
'Rr':'rr',
'Sk':'sk',
'Ku':'ku',
'Hc':'hc',
'Rc':'rc',
'Sc ':'sc',
'Kc ':'kc'}) #note the space after Sc and Kc are errors in naming in the excel file and are corrected here for ease of use later
subsets.append(subset)
subsetsall = pd.concat(subsets, ignore_index=True)
subsetsall
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.006897 | 0.5748 | 0.838397 | 0.998562 | 0.091802 | 0.908459 | 0.003151 | 0.111302 | 0.256742 | 0.389952 |
| 1 | PO | 0.010029 | 0.003448 | 0.4112 | 0.838397 | 0.649317 | 0.039172 | 0.476520 | 0.002817 | 0.121299 | 0.332611 | 0.443785 |
| 2 | PO | 0.007163 | 0.003448 | 0.4400 | 1.007173 | 0.754309 | 0.048079 | 0.766430 | 0.002621 | 0.127759 | 0.323068 | 0.444515 |
| 3 | PO | 0.028653 | 0.003448 | 0.3124 | 0.534599 | 0.061617 | 0.244800 | 0.789110 | 0.010007 | 0.092632 | 0.220312 | 0.339685 |
| 4 | PO | 0.018625 | 0.003448 | 0.4024 | 0.557089 | 0.037346 | 0.578774 | 0.630554 | 0.006757 | 0.073914 | 0.270908 | 0.273045 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 215 | CR | 0.277937 | 0.949262 | 1.0268 | 0.102869 | 0.723013 | 0.025025 | 0.468658 | 0.101296 | 0.757683 | 0.231426 | 0.516244 |
| 216 | CR | 0.148997 | 0.720690 | 0.8172 | 0.055527 | 0.509504 | 0.135456 | 0.551284 | 0.010890 | 0.262126 | 0.410800 | 0.530843 |
| 217 | CR | 0.320917 | 0.846359 | 0.7100 | 0.106793 | 0.407912 | 0.027538 | 0.488077 | 0.191586 | 0.757547 | 0.158517 | 0.559012 |
| 218 | CR | 0.322350 | 0.578386 | 0.6420 | 0.143629 | 0.384393 | 0.039732 | 0.492730 | 0.154902 | 0.640716 | 0.218541 | 0.567931 |
| 219 | CR | 0.372493 | 0.799686 | 0.8580 | 0.167046 | 0.235256 | 0.075930 | 0.558360 | 0.268964 | 0.637409 | 0.164191 | 0.586349 |
220 rows × 12 columns
For Reference:
We now have indivudal dataframes for each subset of data, but we also have one large dataframe for current use / data exploration where applicable:
print(subsets[0].shape)
print(subsets[4].shape)
print(subsetsall.shape)
(44, 12) (44, 12) (220, 12)
With .info() we see that the data is in a very clean format, with no missing values (220 non-null), and all data types are correct.
subsetsall.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 220 entries, 0 to 219 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 220 non-null object 1 w 220 non-null float64 2 ar 220 non-null float64 3 sp 220 non-null float64 4 re 220 non-null float64 5 rr 220 non-null float64 6 sk 220 non-null float64 7 ku 220 non-null float64 8 hc 220 non-null float64 9 rc 220 non-null float64 10 sc 220 non-null float64 11 kc 220 non-null float64 dtypes: float64(11), object(1) memory usage: 20.8+ KB
or we can do so with .dtypes and .isnull().sum().sum() :
subsetsall.dtypes
type object w float64 ar float64 sp float64 re float64 rr float64 sk float64 ku float64 hc float64 rc float64 sc float64 kc float64 dtype: object
subsetsall.isna().sum().sum()
0
We can also use .describe() to get a quick overview of the data:
subsets[0].describe()
| w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 |
| mean | 0.193312 | 0.219518 | 0.596164 | 0.408728 | 0.244100 | 0.190435 | 0.618382 | 0.114831 | 0.227141 | 0.222136 | 0.468713 |
| std | 0.220107 | 0.260694 | 0.217955 | 0.274365 | 0.223650 | 0.171583 | 0.145415 | 0.157291 | 0.209002 | 0.113051 | 0.138047 |
| min | 0.007163 | 0.003448 | 0.192400 | 0.004051 | 0.001971 | 0.011205 | 0.269329 | 0.001358 | 0.032748 | 0.002616 | 0.100178 |
| 25% | 0.041189 | 0.014874 | 0.415600 | 0.145802 | 0.087699 | 0.080167 | 0.536612 | 0.017436 | 0.110421 | 0.142772 | 0.412662 |
| 50% | 0.088109 | 0.062179 | 0.551800 | 0.492258 | 0.169426 | 0.115660 | 0.585224 | 0.038245 | 0.155732 | 0.215720 | 0.468566 |
| 75% | 0.280444 | 0.393534 | 0.830500 | 0.599842 | 0.330571 | 0.243957 | 0.672636 | 0.131052 | 0.218242 | 0.292278 | 0.545424 |
| max | 1.000000 | 0.826724 | 0.928000 | 1.007173 | 0.998562 | 0.681613 | 1.113649 | 0.617477 | 1.001281 | 0.571364 | 0.911416 |
subsets[4].describe()
| w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 |
| mean | 0.175860 | 0.220914 | 0.623318 | 0.391490 | 0.243808 | 0.190940 | 0.582134 | 0.080576 | 0.219547 | 0.225331 | 0.502365 |
| std | 0.188479 | 0.276367 | 0.209529 | 0.277337 | 0.203288 | 0.221725 | 0.163183 | 0.095121 | 0.180705 | 0.152141 | 0.164504 |
| min | 0.007163 | 0.003448 | 0.157200 | 0.002152 | 0.000200 | 0.001608 | 0.168895 | 0.002135 | 0.013219 | 0.001474 | 0.155346 |
| 25% | 0.038682 | 0.009670 | 0.438300 | 0.135696 | 0.093246 | 0.074033 | 0.495114 | 0.015487 | 0.099992 | 0.131640 | 0.416361 |
| 50% | 0.090974 | 0.059112 | 0.638000 | 0.437933 | 0.199206 | 0.132192 | 0.563511 | 0.035016 | 0.178824 | 0.213700 | 0.498491 |
| 75% | 0.263968 | 0.362931 | 0.783000 | 0.606814 | 0.334925 | 0.158128 | 0.643986 | 0.125883 | 0.265546 | 0.309116 | 0.582118 |
| max | 0.816619 | 0.949262 | 1.026800 | 1.007173 | 0.817756 | 1.002376 | 1.128828 | 0.378949 | 0.757683 | 0.729507 | 0.990413 |
subsetsall.describe()
| w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 220.000000 |
| mean | 0.175905 | 0.207577 | 0.599259 | 0.392519 | 0.249057 | 0.167519 | 0.604765 | 0.092316 | 0.216053 | 0.240740 | 0.478316 |
| std | 0.192450 | 0.256669 | 0.216377 | 0.265337 | 0.208797 | 0.164088 | 0.150983 | 0.138605 | 0.173100 | 0.143031 | 0.150625 |
| min | 0.001433 | 0.003448 | 0.025200 | 0.000591 | 0.000118 | 0.001608 | 0.168895 | 0.000013 | 0.004129 | 0.001474 | 0.028573 |
| 25% | 0.035817 | 0.011860 | 0.415600 | 0.131772 | 0.086298 | 0.065242 | 0.519888 | 0.012539 | 0.107777 | 0.145720 | 0.372002 |
| 50% | 0.078080 | 0.062179 | 0.586600 | 0.412764 | 0.213024 | 0.113033 | 0.571244 | 0.033709 | 0.158049 | 0.216357 | 0.481886 |
| 75% | 0.277937 | 0.362931 | 0.826200 | 0.604219 | 0.339437 | 0.198047 | 0.670390 | 0.111988 | 0.254074 | 0.317875 | 0.572017 |
| max | 1.000000 | 1.037931 | 1.026800 | 1.007173 | 1.003975 | 1.002376 | 1.202949 | 1.049198 | 1.001281 | 1.000876 | 1.025173 |
From the 3 .describe() commands, it can be seen that all subsets seem to have similar data, so for further data exploration we will use the full concatenated dataset.
# original subset is 44 rows
# original sebsetsall is 220 rows (44*5)
for i in range(5):
subsets[i].drop_duplicates(inplace=True)
print(subsets[i].shape)
print(subsetsall.shape)
subsetsall_temp = subsetsall.copy()
subsetsall_temp.drop_duplicates(inplace=True)
print(subsetsall_temp.shape)
(44, 12) (44, 12) (44, 12) (44, 12) (44, 12) (220, 12) (219, 12)
We can see that no individual subset has any duplicated entires, but the concatenated dataframe does. With the precision of 4-6 decimal numbers over 11 columns in the dataframe, it is unlikely to be a true measured duplicate and more likely to be an duplicate entry. We will leave the dublicate removed from the concatenated dataframe and remove it from one of the subsets.
duplicated_row = subsetsall[subsetsall.duplicated()]
duplicated_row
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50 | PO | 0.015759 | 0.003448 | 0.4552 | 0.627426 | 0.056636 | 0.116363 | 0.678178 | 0.004218 | 0.107777 | 0.266969 | 0.444385 |
# Assuming you have already read the DataFrames into 'subsets' list
duplicated_row = subsetsall[subsetsall.duplicated()].iloc[0]
def is_same_row(row, target_row):
return row.equals(target_row)
for i in range(5):
# Check if each row in the DataFrame is the same as the example_row
same_rows = subsets[i].apply(is_same_row, axis=1, args=(duplicated_row,))
if same_rows.any():
print("Subset " + str(i+1) + ":")
print("Rows that are the same as the example row:")
print(subsets[i][same_rows])
Subset 1:
Rows that are the same as the example row:
type w ar sp re rr sk ku \
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178
hc rc sc kc
6 0.004218 0.107777 0.266969 0.444385
Subset 2:
Rows that are the same as the example row:
type w ar sp re rr sk ku \
6 PO 0.015759 0.003448 0.4552 0.627426 0.056636 0.116363 0.678178
hc rc sc kc
6 0.004218 0.107777 0.266969 0.444385
It can be seen from the above that subset1 and subset2 have their index 6 row as the same. For reasons mentioned earlier we will remove this row from one of the subsets, in this case subset2 as it came later in the data than subset1.
subsets[1].head(10)
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.003448 | 0.6420 | 0.416456 | 1.003975 | 0.116834 | 0.961553 | 0.003549 | 0.125551 | 0.345271 | 0.407876 |
| 1 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.599335 | 0.100720 | 0.661515 | 0.003367 | 0.100663 | 0.161510 | 0.362336 |
| 2 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.472149 | 0.015586 | 0.373776 | 0.003766 | 0.070719 | 0.240516 | 0.371707 |
| 3 | PO | 0.010029 | 0.003448 | 0.2152 | 0.979030 | 0.421204 | 0.043400 | 0.783220 | 0.003151 | 0.004129 | 0.017908 | 0.028573 |
| 4 | PO | 0.020057 | 0.003448 | 0.5600 | 0.527511 | 0.050374 | 0.211741 | 0.725096 | 0.008584 | 0.084605 | 0.281740 | 0.329137 |
| 5 | PO | 0.011461 | 0.003448 | 0.2996 | 0.773502 | 0.310067 | 0.058798 | 0.536005 | 0.001847 | 0.116556 | 0.199172 | 0.441565 |
| 6 | PO | 0.015759 | 0.003448 | 0.4552 | 0.627426 | 0.056636 | 0.116363 | 0.678178 | 0.004218 | 0.107777 | 0.266969 | 0.444385 |
| 7 | PO | 0.035817 | 0.003448 | 0.4156 | 0.669620 | 0.007348 | 0.266460 | 0.738220 | 0.016878 | 0.039208 | 0.125001 | 0.242095 |
| 8 | PO | 0.011461 | 0.003448 | 0.3000 | 0.773502 | 0.121862 | 0.145167 | 0.534106 | 0.001259 | 0.123437 | 0.205165 | 0.502887 |
| 9 | PO | 0.053009 | 0.003448 | 0.4104 | 0.476751 | 0.081149 | 0.345283 | 1.202949 | 0.034153 | 0.087581 | 0.567489 | 0.475278 |
subsets[1].drop(6, inplace=True)
print(subsets[1].shape)
(43, 12)
subsets[1].head(10)
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.003448 | 0.6420 | 0.416456 | 1.003975 | 0.116834 | 0.961553 | 0.003549 | 0.125551 | 0.345271 | 0.407876 |
| 1 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.599335 | 0.100720 | 0.661515 | 0.003367 | 0.100663 | 0.161510 | 0.362336 |
| 2 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.472149 | 0.015586 | 0.373776 | 0.003766 | 0.070719 | 0.240516 | 0.371707 |
| 3 | PO | 0.010029 | 0.003448 | 0.2152 | 0.979030 | 0.421204 | 0.043400 | 0.783220 | 0.003151 | 0.004129 | 0.017908 | 0.028573 |
| 4 | PO | 0.020057 | 0.003448 | 0.5600 | 0.527511 | 0.050374 | 0.211741 | 0.725096 | 0.008584 | 0.084605 | 0.281740 | 0.329137 |
| 5 | PO | 0.011461 | 0.003448 | 0.2996 | 0.773502 | 0.310067 | 0.058798 | 0.536005 | 0.001847 | 0.116556 | 0.199172 | 0.441565 |
| 7 | PO | 0.035817 | 0.003448 | 0.4156 | 0.669620 | 0.007348 | 0.266460 | 0.738220 | 0.016878 | 0.039208 | 0.125001 | 0.242095 |
| 8 | PO | 0.011461 | 0.003448 | 0.3000 | 0.773502 | 0.121862 | 0.145167 | 0.534106 | 0.001259 | 0.123437 | 0.205165 | 0.502887 |
| 9 | PO | 0.053009 | 0.003448 | 0.4104 | 0.476751 | 0.081149 | 0.345283 | 1.202949 | 0.034153 | 0.087581 | 0.567489 | 0.475278 |
| 10 | SL | 0.018625 | 0.020690 | 0.6496 | 0.744641 | 0.304496 | 0.052827 | 0.370475 | 0.007743 | 0.187048 | 0.500302 | 0.330965 |
Reset index in the two dataframes with removed rows:
subsetsall = subsetsall_temp.copy()
subsetsall.reset_index(drop=True, inplace=True) # use the drop=True to avoid the old index being added as a column, and having to drop it later
subsets[1].reset_index(drop=True, inplace=True)
subsets[1].head(10)
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.003448 | 0.6420 | 0.416456 | 1.003975 | 0.116834 | 0.961553 | 0.003549 | 0.125551 | 0.345271 | 0.407876 |
| 1 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.599335 | 0.100720 | 0.661515 | 0.003367 | 0.100663 | 0.161510 | 0.362336 |
| 2 | PO | 0.012894 | 0.003448 | 0.3784 | 0.235612 | 0.472149 | 0.015586 | 0.373776 | 0.003766 | 0.070719 | 0.240516 | 0.371707 |
| 3 | PO | 0.010029 | 0.003448 | 0.2152 | 0.979030 | 0.421204 | 0.043400 | 0.783220 | 0.003151 | 0.004129 | 0.017908 | 0.028573 |
| 4 | PO | 0.020057 | 0.003448 | 0.5600 | 0.527511 | 0.050374 | 0.211741 | 0.725096 | 0.008584 | 0.084605 | 0.281740 | 0.329137 |
| 5 | PO | 0.011461 | 0.003448 | 0.2996 | 0.773502 | 0.310067 | 0.058798 | 0.536005 | 0.001847 | 0.116556 | 0.199172 | 0.441565 |
| 6 | PO | 0.035817 | 0.003448 | 0.4156 | 0.669620 | 0.007348 | 0.266460 | 0.738220 | 0.016878 | 0.039208 | 0.125001 | 0.242095 |
| 7 | PO | 0.011461 | 0.003448 | 0.3000 | 0.773502 | 0.121862 | 0.145167 | 0.534106 | 0.001259 | 0.123437 | 0.205165 | 0.502887 |
| 8 | PO | 0.053009 | 0.003448 | 0.4104 | 0.476751 | 0.081149 | 0.345283 | 1.202949 | 0.034153 | 0.087581 | 0.567489 | 0.475278 |
| 9 | SL | 0.018625 | 0.020690 | 0.6496 | 0.744641 | 0.304496 | 0.052827 | 0.370475 | 0.007743 | 0.187048 | 0.500302 | 0.330965 |
Use .describe() and .info() on our cleaned data (concatenated only is okay here):
subsetsall.describe()
| w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 |
| mean | 0.176636 | 0.208510 | 0.599917 | 0.391446 | 0.249936 | 0.167753 | 0.604430 | 0.092718 | 0.216547 | 0.240620 | 0.478471 |
| std | 0.192585 | 0.256884 | 0.216652 | 0.265466 | 0.208867 | 0.164427 | 0.151247 | 0.138793 | 0.173341 | 0.143348 | 0.150952 |
| min | 0.001433 | 0.003448 | 0.025200 | 0.000591 | 0.000118 | 0.001608 | 0.168895 | 0.000013 | 0.004129 | 0.001474 | 0.028573 |
| 25% | 0.037250 | 0.012357 | 0.415600 | 0.131266 | 0.087341 | 0.064804 | 0.519690 | 0.012782 | 0.108342 | 0.145656 | 0.371903 |
| 50% | 0.078797 | 0.062834 | 0.587200 | 0.409072 | 0.214041 | 0.112573 | 0.570744 | 0.034153 | 0.158479 | 0.215147 | 0.482290 |
| 75% | 0.277937 | 0.363793 | 0.826800 | 0.602742 | 0.341661 | 0.199904 | 0.669316 | 0.113712 | 0.254108 | 0.318871 | 0.572080 |
| max | 1.000000 | 1.037931 | 1.026800 | 1.007173 | 1.003975 | 1.002376 | 1.202949 | 1.049198 | 1.001281 | 1.000876 | 1.025173 |
subsetsall.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 219 entries, 0 to 218 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 219 non-null object 1 w 219 non-null float64 2 ar 219 non-null float64 3 sp 219 non-null float64 4 re 219 non-null float64 5 rr 219 non-null float64 6 sk 219 non-null float64 7 ku 219 non-null float64 8 hc 219 non-null float64 9 rc 219 non-null float64 10 sc 219 non-null float64 11 kc 219 non-null float64 dtypes: float64(11), object(1) memory usage: 20.7+ KB
We can also check for any previously labeled numerical data to turn into categorical data:
subsetsall.nunique(axis=0)
type 5 w 138 ar 164 sp 181 re 175 rr 219 sk 219 ku 219 hc 218 rc 219 sc 219 kc 219 dtype: int64
The only column with a low enough number on unique values to be considered categorical data is the type column, which it already is and we will leave it as so.
We can create a num list but a cat list is not necessary for this dataset as there is only one categorical column.
nums = list(subsetsall.select_dtypes(exclude=['object']).columns)
nums
['w', 'ar', 'sp', 're', 'rr', 'sk', 'ku', 'hc', 'rc', 'sc', 'kc']
Visualize Data¶PairPlot:
sns.pairplot(subsetsall, vars=nums, hue='type')
<seaborn.axisgrid.PairGrid at 0x21daec13070>
It can be seen that type PO spikes heavily in some cases such as hc vs hc and ar vs ac, but also is ponounced in other cases. We will talk about this later.
Currently, we can identify some features that are useful for finding trends. From initial visual inspeciton, re and rc plotted against any other feature seem to decently split the data into their categorical types. Specifically, re vs rc seems the individual best.
However, others are better at splitting only certain types.
For example, kc vs w, ar, rr, and hc split PO and SL quite well, while ar vs rc splits LP and CR well.
A heatmap of the correlation between the numerical data can be made:
sns.heatmap(subsetsall[nums].corr(), annot=True)
<Axes: >
re by far has some very good correlations with other features, nameley w, ar, and sp. rc also has some good ones.
Our visual inference was accurate.
Noimalizaiton will likley not help here as it can already be seen that the data is near normal (max near 1 and min near 0):
subsetsall.describe()
| w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 | 219.000000 |
| mean | 0.176636 | 0.208510 | 0.599917 | 0.391446 | 0.249936 | 0.167753 | 0.604430 | 0.092718 | 0.216547 | 0.240620 | 0.478471 |
| std | 0.192585 | 0.256884 | 0.216652 | 0.265466 | 0.208867 | 0.164427 | 0.151247 | 0.138793 | 0.173341 | 0.143348 | 0.150952 |
| min | 0.001433 | 0.003448 | 0.025200 | 0.000591 | 0.000118 | 0.001608 | 0.168895 | 0.000013 | 0.004129 | 0.001474 | 0.028573 |
| 25% | 0.037250 | 0.012357 | 0.415600 | 0.131266 | 0.087341 | 0.064804 | 0.519690 | 0.012782 | 0.108342 | 0.145656 | 0.371903 |
| 50% | 0.078797 | 0.062834 | 0.587200 | 0.409072 | 0.214041 | 0.112573 | 0.570744 | 0.034153 | 0.158479 | 0.215147 | 0.482290 |
| 75% | 0.277937 | 0.363793 | 0.826800 | 0.602742 | 0.341661 | 0.199904 | 0.669316 | 0.113712 | 0.254108 | 0.318871 | 0.572080 |
| max | 1.000000 | 1.037931 | 1.026800 | 1.007173 | 1.003975 | 1.002376 | 1.202949 | 1.049198 | 1.001281 | 1.000876 | 1.025173 |
We can try standardization to see if it helps:
subsetsall_std = subsetsall.copy()
subsetsall_std.head()
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.006897 | 0.5748 | 0.838397 | 0.998562 | 0.091802 | 0.908459 | 0.003151 | 0.111302 | 0.256742 | 0.389952 |
| 1 | PO | 0.010029 | 0.003448 | 0.4112 | 0.838397 | 0.649317 | 0.039172 | 0.476520 | 0.002817 | 0.121299 | 0.332611 | 0.443785 |
| 2 | PO | 0.007163 | 0.003448 | 0.4400 | 1.007173 | 0.754309 | 0.048079 | 0.766430 | 0.002621 | 0.127759 | 0.323068 | 0.444515 |
| 3 | PO | 0.028653 | 0.003448 | 0.3124 | 0.534599 | 0.061617 | 0.244800 | 0.789110 | 0.010007 | 0.092632 | 0.220312 | 0.339685 |
| 4 | PO | 0.018625 | 0.003448 | 0.4024 | 0.557089 | 0.037346 | 0.578774 | 0.630554 | 0.006757 | 0.073914 | 0.270908 | 0.273045 |
sc = '_std'
nums_std = []
for s in nums:
s = s + sc
nums_std.append(s)
print(nums_std)
['w_std', 'ar_std', 'sp_std', 're_std', 'rr_std', 'sk_std', 'ku_std', 'hc_std', 'rc_std', 'sc_std', 'kc_std']
std_Scaler = preprocessing.StandardScaler()
std_Scaler.fit(subsetsall_std[nums])
subsetsall_std[nums_std] = std_Scaler.transform(subsetsall_std[nums])
subsetsall_std.drop(nums, axis=1, inplace=True)
subsetsall_std.head()
| type | w_std | ar_std | sp_std | re_std | rr_std | sk_std | ku_std | hc_std | rc_std | sc_std | kc_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | -0.874553 | -0.786637 | -0.116198 | 1.687500 | 3.592437 | -0.462970 | 2.014749 | -0.646807 | -0.608549 | 0.112723 | -0.587744 |
| 1 | PO | -0.867095 | -0.800094 | -0.873054 | 1.687500 | 1.916513 | -0.783785 | -0.847640 | -0.649219 | -0.550744 | 0.643202 | -0.230305 |
| 2 | PO | -0.882011 | -0.800094 | -0.739818 | 2.324728 | 2.420339 | -0.729491 | 1.073546 | -0.650634 | -0.513391 | 0.576477 | -0.225458 |
| 3 | PO | -0.770168 | -0.800094 | -1.330129 | 0.540485 | -0.903687 | 0.469655 | 1.223842 | -0.597296 | -0.716502 | -0.141997 | -0.921506 |
| 4 | PO | -0.822358 | -0.800094 | -0.913765 | 0.625398 | -1.020156 | 2.505449 | 0.173118 | -0.620766 | -0.824734 | 0.211772 | -1.363982 |
subsetsall_std.describe()
| w_std | ar_std | sp_std | re_std | rr_std | sk_std | ku_std | hc_std | rc_std | sc_std | kc_std | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 | 2.190000e+02 |
| mean | -8.111218e-18 | -3.244487e-17 | 1.662800e-16 | -9.733462e-17 | -6.083414e-18 | -7.502877e-17 | 1.022014e-15 | 3.244487e-17 | -4.461170e-17 | 9.125121e-17 | -4.866731e-17 |
| std | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 | 1.002291e+00 |
| min | -9.118320e-01 | -8.000944e-01 | -2.658791e+00 | -1.475707e+00 | -1.198802e+00 | -1.012763e+00 | -2.886221e+00 | -6.694647e-01 | -1.228245e+00 | -1.672119e+00 | -2.987224e+00 |
| 25% | -7.254280e-01 | -7.653339e-01 | -8.526988e-01 | -9.823347e-01 | -7.802472e-01 | -6.275409e-01 | -5.615597e-01 | -5.772531e-01 | -6.256668e-01 | -6.639940e-01 | -7.075824e-01 |
| 50% | -5.091975e-01 | -5.683864e-01 | -5.883228e-02 | 6.654728e-02 | -1.722487e-01 | -3.363570e-01 | -2.232333e-01 | -4.229268e-01 | -3.357616e-01 | -1.781106e-01 | 2.536041e-02 |
| 75% | 5.272102e-01 | 6.058740e-01 | 1.049620e+00 | 7.977667e-01 | 4.401597e-01 | 1.959837e-01 | 4.299905e-01 | 1.516094e-01 | 2.171876e-01 | 5.471275e-01 | 6.215435e-01 |
| max | 4.285127e+00 | 3.236178e+00 | 1.974872e+00 | 2.324728e+00 | 3.618412e+00 | 5.087585e+00 | 3.966286e+00 | 6.907185e+00 | 4.537491e+00 | 5.315733e+00 | 3.629988e+00 |
sns.pairplot(subsetsall_std, vars=nums_std, hue='type')
<seaborn.axisgrid.PairGrid at 0x21dbe337e20>
The data still looks very similar. We will not use the stnadardized data and instead stick with the original near normal data.
Lets come back to the PO type spike from our pairplots. Lets check if it is because of many PO types in the data:
sns.countplot(x='type', data=subsetsall)
<Axes: xlabel='type', ylabel='count'>
There is not an overwhelming amount of PO types in the data, so it must be truly strongly correlated as represnted by the pairplot Kernel Density Equation diagonal.
We can investigate it further with a boxplot.
sns.boxplot(x='type', y='hc', data=subsetsall)
<Axes: xlabel='type', ylabel='hc'>
Low hc very likely means the weld defect is PO. This is a strong relation we can use later but must ensure we do not overlook other features, as other types also have some low hc values.
More plots of the data can be made to try to spot anything not yet clear if questions arise in later analysis.
Additionals:¶One thing we might want to do for later is binarize the type column so we can use it in our models:
types = subsetsall['type'].unique()
types = list(types)
types
['PO', 'SL', 'LP', 'LF', 'CR']
from sklearn.preprocessing import label_binarize
type_num = label_binarize(subsetsall.type, classes=types)
print(type_num)
[[1 0 0 0 0] [1 0 0 0 0] [1 0 0 0 0] ... [0 0 0 0 1] [0 0 0 0 1] [0 0 0 0 1]]
for i in range(len(types)):
subsetsall[types[i]] = type_num[:,i]
subsetsall
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | PO | SL | LP | LF | CR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.006897 | 0.5748 | 0.838397 | 0.998562 | 0.091802 | 0.908459 | 0.003151 | 0.111302 | 0.256742 | 0.389952 | 1 | 0 | 0 | 0 | 0 |
| 1 | PO | 0.010029 | 0.003448 | 0.4112 | 0.838397 | 0.649317 | 0.039172 | 0.476520 | 0.002817 | 0.121299 | 0.332611 | 0.443785 | 1 | 0 | 0 | 0 | 0 |
| 2 | PO | 0.007163 | 0.003448 | 0.4400 | 1.007173 | 0.754309 | 0.048079 | 0.766430 | 0.002621 | 0.127759 | 0.323068 | 0.444515 | 1 | 0 | 0 | 0 | 0 |
| 3 | PO | 0.028653 | 0.003448 | 0.3124 | 0.534599 | 0.061617 | 0.244800 | 0.789110 | 0.010007 | 0.092632 | 0.220312 | 0.339685 | 1 | 0 | 0 | 0 | 0 |
| 4 | PO | 0.018625 | 0.003448 | 0.4024 | 0.557089 | 0.037346 | 0.578774 | 0.630554 | 0.006757 | 0.073914 | 0.270908 | 0.273045 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 214 | CR | 0.277937 | 0.949262 | 1.0268 | 0.102869 | 0.723013 | 0.025025 | 0.468658 | 0.101296 | 0.757683 | 0.231426 | 0.516244 | 0 | 0 | 0 | 0 | 1 |
| 215 | CR | 0.148997 | 0.720690 | 0.8172 | 0.055527 | 0.509504 | 0.135456 | 0.551284 | 0.010890 | 0.262126 | 0.410800 | 0.530843 | 0 | 0 | 0 | 0 | 1 |
| 216 | CR | 0.320917 | 0.846359 | 0.7100 | 0.106793 | 0.407912 | 0.027538 | 0.488077 | 0.191586 | 0.757547 | 0.158517 | 0.559012 | 0 | 0 | 0 | 0 | 1 |
| 217 | CR | 0.322350 | 0.578386 | 0.6420 | 0.143629 | 0.384393 | 0.039732 | 0.492730 | 0.154902 | 0.640716 | 0.218541 | 0.567931 | 0 | 0 | 0 | 0 | 1 |
| 218 | CR | 0.372493 | 0.799686 | 0.8580 | 0.167046 | 0.235256 | 0.075930 | 0.558360 | 0.268964 | 0.637409 | 0.164191 | 0.586349 | 0 | 0 | 0 | 0 | 1 |
219 rows × 17 columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # Create label encoder
encoded_subsetsall = []
for i in range(5):
subsetsall['type_num'] = le.fit_transform(subsetsall['type']) # encode type column
subsetsall.head(40)
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | PO | SL | LP | LF | CR | type_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.006897 | 0.5748 | 0.838397 | 0.998562 | 0.091802 | 0.908459 | 0.003151 | 0.111302 | 0.256742 | 0.389952 | 1 | 0 | 0 | 0 | 0 | 3 |
| 1 | PO | 0.010029 | 0.003448 | 0.4112 | 0.838397 | 0.649317 | 0.039172 | 0.476520 | 0.002817 | 0.121299 | 0.332611 | 0.443785 | 1 | 0 | 0 | 0 | 0 | 3 |
| 2 | PO | 0.007163 | 0.003448 | 0.4400 | 1.007173 | 0.754309 | 0.048079 | 0.766430 | 0.002621 | 0.127759 | 0.323068 | 0.444515 | 1 | 0 | 0 | 0 | 0 | 3 |
| 3 | PO | 0.028653 | 0.003448 | 0.3124 | 0.534599 | 0.061617 | 0.244800 | 0.789110 | 0.010007 | 0.092632 | 0.220312 | 0.339685 | 1 | 0 | 0 | 0 | 0 | 3 |
| 4 | PO | 0.018625 | 0.003448 | 0.4024 | 0.557089 | 0.037346 | 0.578774 | 0.630554 | 0.006757 | 0.073914 | 0.270908 | 0.273045 | 1 | 0 | 0 | 0 | 0 | 3 |
| 5 | PO | 0.011461 | 0.003448 | 0.2996 | 0.773502 | 0.133474 | 0.243676 | 0.452340 | 0.001358 | 0.090320 | 0.259598 | 0.482290 | 1 | 0 | 0 | 0 | 0 | 3 |
| 6 | PO | 0.015759 | 0.003448 | 0.4552 | 0.627426 | 0.056636 | 0.116363 | 0.678178 | 0.004218 | 0.107777 | 0.266969 | 0.444385 | 1 | 0 | 0 | 0 | 0 | 3 |
| 7 | PO | 0.027221 | 0.003448 | 0.4156 | 0.557089 | 0.101013 | 0.093192 | 0.939251 | 0.008386 | 0.070805 | 0.002616 | 0.368677 | 1 | 0 | 0 | 0 | 0 | 3 |
| 8 | PO | 0.030086 | 0.003448 | 0.4248 | 0.513840 | 0.001971 | 0.358502 | 0.653443 | 0.014692 | 0.032748 | 0.172884 | 0.287875 | 1 | 0 | 0 | 0 | 0 | 3 |
| 9 | PO | 0.035817 | 0.003448 | 0.4156 | 0.669620 | 0.004285 | 0.681613 | 0.451320 | 0.017031 | 0.038732 | 0.211128 | 0.444937 | 1 | 0 | 0 | 0 | 0 | 3 |
| 10 | SL | 0.050143 | 0.034648 | 0.5228 | 0.516920 | 0.177013 | 0.063348 | 0.494043 | 0.019171 | 0.197270 | 0.301136 | 0.565477 | 0 | 1 | 0 | 0 | 0 | 4 |
| 11 | SL | 0.025788 | 0.015617 | 0.3340 | 0.471477 | 0.276169 | 0.063095 | 0.507786 | 0.005105 | 0.141924 | 0.245195 | 0.542960 | 0 | 1 | 0 | 0 | 0 | 4 |
| 12 | SL | 0.063037 | 0.081610 | 0.4756 | 0.551477 | 0.400952 | 0.085216 | 0.625463 | 0.026221 | 0.226071 | 0.105925 | 0.592413 | 0 | 1 | 0 | 0 | 0 | 4 |
| 13 | SL | 0.067335 | 0.058621 | 0.5132 | 0.611181 | 0.390934 | 0.082072 | 0.670788 | 0.028746 | 0.174749 | 0.118310 | 0.519074 | 0 | 1 | 0 | 0 | 0 | 4 |
| 14 | SL | 0.042980 | 0.021438 | 0.5288 | 0.766076 | 0.346107 | 0.140695 | 0.653494 | 0.017571 | 0.150256 | 0.382587 | 0.420232 | 0 | 1 | 0 | 0 | 0 | 4 |
| 15 | SL | 0.138968 | 0.115517 | 0.5192 | 0.513840 | 0.053111 | 0.232464 | 0.481685 | 0.214936 | 0.302905 | 0.081632 | 0.646642 | 0 | 1 | 0 | 0 | 0 | 4 |
| 16 | SL | 0.147564 | 0.023731 | 0.2656 | 0.884430 | 0.039914 | 0.596128 | 0.654200 | 0.109447 | 0.097271 | 0.268753 | 0.100178 | 0 | 1 | 0 | 0 | 0 | 4 |
| 17 | SL | 0.074499 | 0.012645 | 0.1924 | 0.631139 | 0.099007 | 0.292338 | 0.753337 | 0.041720 | 0.102973 | 0.206860 | 0.436098 | 0 | 1 | 0 | 0 | 0 | 4 |
| 18 | SL | 0.074499 | 0.044562 | 0.4872 | 0.599283 | 0.089800 | 0.320126 | 0.538374 | 0.027256 | 0.125914 | 0.240262 | 0.627607 | 0 | 1 | 0 | 0 | 0 | 4 |
| 19 | SL | 0.065903 | 0.061524 | 0.6596 | 0.513840 | 0.161838 | 0.643830 | 0.269329 | 0.034770 | 0.158479 | 0.210568 | 0.911416 | 0 | 1 | 0 | 0 | 0 | 4 |
| 20 | LP | 0.415473 | 0.431379 | 0.8668 | 0.068861 | 0.125231 | 0.168406 | 0.767691 | 0.088235 | 0.193620 | 0.131252 | 0.242035 | 0 | 0 | 1 | 0 | 0 | 2 |
| 21 | LP | 0.613181 | 0.324466 | 0.7932 | 0.253713 | 0.030618 | 0.114958 | 0.715633 | 0.483561 | 0.106063 | 0.041874 | 0.302604 | 0 | 0 | 1 | 0 | 0 | 2 |
| 22 | LP | 0.253582 | 0.417241 | 0.8664 | 0.110422 | 0.329608 | 0.069216 | 0.537527 | 0.060204 | 0.208140 | 0.010399 | 0.480510 | 0 | 0 | 1 | 0 | 0 | 2 |
| 23 | LP | 0.187679 | 0.359769 | 0.8320 | 0.037932 | 0.318771 | 0.116448 | 1.113649 | 0.017783 | 0.156013 | 0.068120 | 0.511571 | 0 | 0 | 1 | 0 | 0 | 2 |
| 24 | LP | 0.550143 | 0.242717 | 0.8284 | 0.087806 | 0.068943 | 0.479354 | 0.693576 | 0.411346 | 0.105377 | 0.289325 | 0.250541 | 0 | 0 | 1 | 0 | 0 | 2 |
| 25 | LP | 1.000000 | 0.642338 | 0.8376 | 0.112152 | 0.134046 | 0.106603 | 0.670257 | 0.617477 | 0.132390 | 0.144749 | 0.343697 | 0 | 0 | 1 | 0 | 0 | 2 |
| 26 | LP | 0.777937 | 0.493869 | 0.8384 | 0.046878 | 0.144770 | 0.091868 | 0.657572 | 0.266960 | 0.134514 | 0.331065 | 0.354818 | 0 | 0 | 1 | 0 | 0 | 2 |
| 27 | LP | 0.246418 | 0.308045 | 0.8300 | 0.004051 | 0.088274 | 0.344919 | 0.551920 | 0.050698 | 0.145350 | 0.306394 | 0.423902 | 0 | 0 | 1 | 0 | 0 | 2 |
| 28 | LP | 0.339542 | 0.289921 | 0.8372 | 0.102152 | 0.053839 | 0.221241 | 0.570290 | 0.172724 | 0.166593 | 0.162887 | 0.479350 | 0 | 0 | 1 | 0 | 0 | 2 |
| 29 | LP | 0.343840 | 0.391379 | 0.8336 | 0.015190 | 0.085972 | 0.160149 | 0.595178 | 0.117161 | 0.181476 | 0.339861 | 0.552816 | 0 | 0 | 1 | 0 | 0 | 2 |
| 30 | LF | 0.402579 | 0.221838 | 0.6692 | 0.209451 | 0.103607 | 0.246070 | 0.577897 | 0.417545 | 0.247450 | 0.204585 | 0.438135 | 0 | 0 | 0 | 1 | 0 | 1 |
| 31 | LF | 0.063037 | 0.062834 | 0.2300 | 0.304515 | 0.299782 | 0.203618 | 0.640899 | 0.019800 | 0.181293 | 0.189060 | 0.498422 | 0 | 0 | 0 | 1 | 0 | 1 |
| 32 | LF | 0.206304 | 0.111686 | 0.7932 | 0.353038 | 0.130069 | 0.098141 | 0.547505 | 0.227999 | 0.333604 | 0.163009 | 0.631823 | 0 | 0 | 0 | 1 | 0 | 1 |
| 33 | LF | 0.071633 | 0.055172 | 0.5780 | 0.409072 | 0.196221 | 0.068892 | 0.592551 | 0.029966 | 0.215632 | 0.344792 | 0.457783 | 0 | 0 | 0 | 1 | 0 | 1 |
| 34 | LF | 0.075931 | 0.027790 | 0.4628 | 0.571181 | 0.333461 | 0.044079 | 0.570744 | 0.052599 | 0.161889 | 0.131012 | 0.480531 | 0 | 0 | 0 | 1 | 0 | 1 |
| 35 | LF | 0.073066 | 0.022607 | 0.3972 | 0.601519 | 0.224584 | 0.211671 | 0.751020 | 0.061937 | 0.114510 | 0.136843 | 0.454621 | 0 | 0 | 0 | 1 | 0 | 1 |
| 36 | LF | 0.110315 | 0.054652 | 0.3364 | 0.513038 | 0.258901 | 0.038028 | 0.499730 | 0.071062 | 0.155450 | 0.101629 | 0.489576 | 0 | 0 | 0 | 1 | 0 | 1 |
| 37 | CR | 0.100287 | 0.400000 | 0.7424 | 0.152025 | 0.516416 | 0.137537 | 0.555328 | 0.022585 | 0.432063 | 0.414954 | 0.528984 | 0 | 0 | 0 | 0 | 1 | 0 |
| 38 | CR | 0.209169 | 0.547510 | 0.8400 | 0.127131 | 0.405834 | 0.074451 | 0.543629 | 0.034410 | 0.347480 | 0.284818 | 0.439879 | 0 | 0 | 0 | 0 | 1 | 0 |
| 39 | CR | 0.277937 | 0.826724 | 0.8816 | 0.085738 | 0.704132 | 0.011205 | 0.486398 | 0.104069 | 0.731836 | 0.149571 | 0.529300 | 0 | 0 | 0 | 0 | 1 | 0 |
Classification¶As explored in data preprocessing and visualziation we will use the re and rc features to classify the data.
Lets review this indivudal scatterplot:
sns.scatterplot(data=subsetsall, x='re', y='rc', hue='type')
plt.show()
Upon inspection of this plot, it is best to use the following 3 types as our targets: CR, LP, SL. (CR and LP were noted earlier in data preprocessing and visualization, but we will add SL as well due to what we can see more closely in this blowup of the plot.)
More inferences can also be made:
It will be easy to make a decision boundary between LP and SL but harder to make a decision with a KNN approach here as the spread over the re axis is larger for SL than LP.
For the split between CR and LP, it seems to be the opposite case. A KNN approach might produce better results due to the closely clumped data points of CR near the fuzzy border and decently clumped LP data points slightly fartehr from the border.
Also, before we move on to classification, we will do a quick cleanup / restructure of the data to only contain the features and targets we want as stated above. (We will also leave the data in its subsets as to use these as splits for training and testing.) (Also, we encode the type column for later use.)
encoded_subsets = []
for i in range(5):
subsets[i] = subsets[i][['type','re', 'rc']] # only keep the columns we want (by features + target column)
subsets[i] = subsets[i][subsets[i]['type'].isin(['CR','LP','SL'])] # only keep the rows we want (by type)
subsets[i].reset_index(drop=True, inplace=True) # reset index
subsets[i]['type_num'] = le.fit_transform(subsets[i]['type']) # encode type column
print(subsets[i].shape)
(27, 4) (27, 4) (27, 4) (27, 4) (27, 4)
The total of these types in each subset was equivalet, that's convenient for cross validation later.
One of the new subsets:
subsets[4]
| type | re | rc | type_num | |
|---|---|---|---|---|
| 0 | SL | 0.725865 | 0.218229 | 2 |
| 1 | SL | 0.638143 | 0.212740 | 2 |
| 2 | SL | 0.504768 | 0.146195 | 2 |
| 3 | SL | 0.316709 | 0.188856 | 2 |
| 4 | SL | 0.870253 | 0.217319 | 2 |
| 5 | SL | 0.730675 | 0.095278 | 2 |
| 6 | SL | 0.606329 | 0.179129 | 2 |
| 7 | SL | 0.622489 | 0.101563 | 2 |
| 8 | SL | 0.771772 | 0.092916 | 2 |
| 9 | SL | 0.585232 | 0.160975 | 2 |
| 10 | LP | 0.091646 | 0.182427 | 1 |
| 11 | LP | 0.162194 | 0.074074 | 1 |
| 12 | LP | 0.115063 | 0.233013 | 1 |
| 13 | LP | 0.153840 | 0.207985 | 1 |
| 14 | LP | 0.185274 | 0.132899 | 1 |
| 15 | LP | 0.003038 | 0.092287 | 1 |
| 16 | LP | 0.100253 | 0.130390 | 1 |
| 17 | LP | 0.002152 | 0.204965 | 1 |
| 18 | LP | 0.042278 | 0.178519 | 1 |
| 19 | LP | 0.006034 | 0.292241 | 1 |
| 20 | CR | 0.142574 | 0.308130 | 0 |
| 21 | CR | 0.071941 | 0.398472 | 0 |
| 22 | CR | 0.102869 | 0.757683 | 0 |
| 23 | CR | 0.055527 | 0.262126 | 0 |
| 24 | CR | 0.106793 | 0.757547 | 0 |
| 25 | CR | 0.143629 | 0.640716 | 0 |
| 26 | CR | 0.167046 | 0.637409 | 0 |
KNN¶Now we can make our KNN model.
Create training and testing data from the subsets:
X = []
y = []
for i in range(5):
X.append(subsets[i][['re', 'rc']].values)
y.append(subsets[i]['type_num'].values)
print(X[4].shape)
print(y[4].shape)
# Training Data - subsets 1-4 (80% of data)
X_train = np.concatenate(X[:4], axis=0)
y_train = np.concatenate(y[:4], axis=0)
print(X_train.shape)
print(y_train.shape)
# Training Data - subset 5 (20% of data)
X_test = X[4]
y_test = y[4]
print(X_test.shape)
print(y_test.shape)
(27, 2) (27,) (108, 2) (108,) (27, 2) (27,)
Jointplot for more visualizaiton before we begin:
sns.jointplot(x='re', y='rc', data=subsets[4], hue='type_num', kind='scatter')
<seaborn.axisgrid.JointGrid at 0x21dc749eda0>
Fit KNN model:
from sklearn.neighbors import KNeighborsClassifier
k = 5
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
Plot the decision boundary of the model:
#---min and max for the first feature---
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
#---min and max for the second feature---
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
#---step size in the mesh---
x_step = (x_max - x_min) / 100
y_step = (y_max - y_min) / 100
#---make predictions for each of the points in xx,yy---
xx, yy = np.meshgrid(np.arange(x_min, x_max, x_step), np.arange(y_min, y_max, y_step))
Z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])
#---draw the result using a color plot---
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=0.8)
#---plot the training points---
colors = ['red', 'green', 'blue']
types = ['CR', 'LP', 'SL']
for color, i, target in zip(colors, [0, 1, 2], types):
plt.scatter(X_train[y_train==i, 0], X_train[y_train==i, 1], color=color, label=target)
plt.xlabel('Roughness of Defect Edge (re)')
plt.ylabel('Roughness Contrast (rc)')
plt.title(f'Decision Surface for KNN model with (k={k})')
plt.legend(loc='best', shadow=False, scatterpoints=1)
<matplotlib.legend.Legend at 0x21dc70dbbe0>
It seems like there may be some slight overfitting here, but it is not too bad. We will see how it performs.
Predict weld defect types using testing data:
y_pred = knn_model.predict(X_test)
print(y_pred)
[2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]
from sklearn.metrics import confusion_matrix
mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = [[7 0 0] [1 9 0] [0 1 9]]
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Actual Labels')
Text(17.25, 0.5, 'Actual Labels')
Some precursors to confusion matrix calculations:
# True Positive (TP) = diagonal elements
CR_TP = mat_test[0,0]
LP_TP = mat_test[1,1]
SL_TP = mat_test[2,2]
print(CR_TP, LP_TP, SL_TP)
# False Negative (FN) = sum of row - TP
CR_FN = sum(mat_test[0])-CR_TP
LP_FN = sum(mat_test[1])-LP_TP
SL_FN = sum(mat_test[2])-SL_TP
print(CR_FN, LP_FN, SL_FN)
# False Positive (FP) = sum of column - TP
CR_FP = sum(mat_test[:,0])-CR_TP
LP_FP = sum(mat_test[:,1])-LP_TP
SL_FP = sum(mat_test[:,2])-SL_TP
print(CR_FP, LP_FP, SL_FP)
7 9 9 0 1 1 1 1 0
The True Positive Rate (or Recall or Sensitivity) can be calculated using the formula:
CR_TPR = CR_TP/(CR_TP+CR_FN)
LP_TPR = LP_TP/(LP_TP+LP_FN)
SL_TPR = SL_TP/(SL_TP+SL_FN)
print(CR_TPR, LP_TPR, SL_TPR)
1.0 0.9 0.9
The Positive Predictive Rate (or Precision) can be calculated using the formula:
CR_PPR = CR_TP/(CR_TP+CR_FP)
LP_PPR = LP_TP/(LP_TP+LP_FP)
SL_PPR = SL_TP/(SL_TP+SL_FP)
print(CR_PPR, LP_PPR, SL_PPR)
0.875 0.9 1.0
Final Accuracy can be calulated using:
PPR = (CR_TP + LP_TP + SL_TP)/sum(sum(mat_test))
print(PPR)
0.9259259259259259
Let's verify our manual calculations with the classification_report function:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=types))
precision recall f1-score support
CR 0.88 1.00 0.93 7
LP 0.90 0.90 0.90 10
SL 1.00 0.90 0.95 10
accuracy 0.93 27
macro avg 0.92 0.93 0.93 27
weighted avg 0.93 0.93 0.93 27
An ~93% accurate model is not bad here. CR is 100% accurately predicted, most likely largely due to the large clump of its datapoints near its border with LP. Actual LP and SL points each have 1 as misclassified. The LP misclassification has its datapoint over in the CR region while SL has its datapoint in the LP region. This was expected when first visualizing the data in the singular re vs rc scatterplot earlier. However, the accuracy turned out greater than expected.
Finding best K value¶Our model works well but it is always good to check if we can improve it. We can use a for loop to find the best K value for our model:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
ac_scores = []
k_neighbors = list(range(1,21))
k_neighbors = [k for k in k_neighbors if k % 3 != 0] # remove multiples of 3 to avoid ties
for k in k_neighbors:
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'k={k}: {f1*100:0.2f}%')
# print(classification_report(y_test, y_pred, target_names=types))
ac_score = accuracy_score(y_test, y_pred)
ac_scores.append(ac_score)
k=1: 92.59% k=2: 89.06% k=4: 92.62% k=5: 92.62% k=7: 92.69% k=8: 89.06% k=10: 92.69% k=11: 92.69% k=13: 92.69% k=14: 92.69% k=16: 92.69% k=17: 92.69% k=19: 92.69% k=20: 92.69%
It seems that our original and default k value of 5 was very good for our model. However, there is a slightly higher percentage when going to k=10 and above. That slight increase in accuracy may not be worth the compuational power in some other cases but here it does not affect our research usage so we can change our k to 10 or so if we were to predict further or perform k-fold cross validation. Higher k values also reduced risk of overfitting.
Misclassification Error¶Looking into and plotting the Misclassification Error (MSE):
# changing to misclassification error:
MSE = [1 - x for x in ac_scores]
# determining best k:
optimal_k = k_neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
The optimal number of neighbors is 1
# plot misclassification error vs k:
plt.plot(k_neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
We can see that at 1 neighbor, the error is lowest. However, this would be heavily overfit. Our original k value of 5 can be good but in between two peaks of error, this may not be the best choice. As stated before we even looked at MSE, we should go for a higher k. Perhaps 11 or 13 would be a comfortable choice.
K-Fold Cross Validation Using Entire Dataset¶Instead of just using 2 features and 3 weld defect types, we can use the entire dataset with all features and all 5 weld defect types to see how this model will do. We will do this across multiple folds as well.
subsetsall # our combination of subsets as one large dataset (with encoded type columns, and 1 duplicate row removed and as discussed earlier)
| type | w | ar | sp | re | rr | sk | ku | hc | rc | sc | kc | PO | SL | LP | LF | CR | type_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PO | 0.008596 | 0.006897 | 0.5748 | 0.838397 | 0.998562 | 0.091802 | 0.908459 | 0.003151 | 0.111302 | 0.256742 | 0.389952 | 1 | 0 | 0 | 0 | 0 | 3 |
| 1 | PO | 0.010029 | 0.003448 | 0.4112 | 0.838397 | 0.649317 | 0.039172 | 0.476520 | 0.002817 | 0.121299 | 0.332611 | 0.443785 | 1 | 0 | 0 | 0 | 0 | 3 |
| 2 | PO | 0.007163 | 0.003448 | 0.4400 | 1.007173 | 0.754309 | 0.048079 | 0.766430 | 0.002621 | 0.127759 | 0.323068 | 0.444515 | 1 | 0 | 0 | 0 | 0 | 3 |
| 3 | PO | 0.028653 | 0.003448 | 0.3124 | 0.534599 | 0.061617 | 0.244800 | 0.789110 | 0.010007 | 0.092632 | 0.220312 | 0.339685 | 1 | 0 | 0 | 0 | 0 | 3 |
| 4 | PO | 0.018625 | 0.003448 | 0.4024 | 0.557089 | 0.037346 | 0.578774 | 0.630554 | 0.006757 | 0.073914 | 0.270908 | 0.273045 | 1 | 0 | 0 | 0 | 0 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 214 | CR | 0.277937 | 0.949262 | 1.0268 | 0.102869 | 0.723013 | 0.025025 | 0.468658 | 0.101296 | 0.757683 | 0.231426 | 0.516244 | 0 | 0 | 0 | 0 | 1 | 0 |
| 215 | CR | 0.148997 | 0.720690 | 0.8172 | 0.055527 | 0.509504 | 0.135456 | 0.551284 | 0.010890 | 0.262126 | 0.410800 | 0.530843 | 0 | 0 | 0 | 0 | 1 | 0 |
| 216 | CR | 0.320917 | 0.846359 | 0.7100 | 0.106793 | 0.407912 | 0.027538 | 0.488077 | 0.191586 | 0.757547 | 0.158517 | 0.559012 | 0 | 0 | 0 | 0 | 1 | 0 |
| 217 | CR | 0.322350 | 0.578386 | 0.6420 | 0.143629 | 0.384393 | 0.039732 | 0.492730 | 0.154902 | 0.640716 | 0.218541 | 0.567931 | 0 | 0 | 0 | 0 | 1 | 0 |
| 218 | CR | 0.372493 | 0.799686 | 0.8580 | 0.167046 | 0.235256 | 0.075930 | 0.558360 | 0.268964 | 0.637409 | 0.164191 | 0.586349 | 0 | 0 | 0 | 0 | 1 | 0 |
219 rows × 18 columns
X, y = subsetsall[nums].values, subsetsall['type'].values
print(X.shape)
print(y.shape)
(219, 11) (219,)
from sklearn.model_selection import cross_val_score
#---holds the cv (cross-validates) scores---
cv_scores = []
#---number of folds---
folds = 10
#---creating odd list of K for KNN---
ks = list(range(1, int(len(X) * ((folds - 1)/folds))))
# ---remove all multiples of 5 as this is a 5 class problem and we want to avoid ties---
ks = [k for k in ks if k % 5 != 0]
#---perform k-fold cross validation---
for k in ks:
knn = KNeighborsClassifier(n_neighbors=k)
#---performs cross-validation and returns the average accuracy---
scores = cross_val_score(knn, X, y, cv=folds, scoring='accuracy')
mean = scores.mean()
cv_scores.append(mean)
print(k, mean)
1 0.8123376623376621 2 0.79004329004329 3 0.7898268398268399 4 0.8175324675324676 6 0.8220779220779221 7 0.8220779220779221 8 0.8359307359307359 9 0.8177489177489177 11 0.7902597402597402 12 0.8043290043290042 13 0.7993506493506494 14 0.7904761904761906 16 0.7902597402597402 17 0.7764069264069263 18 0.7809523809523811 19 0.7945887445887446 21 0.7673160173160173 22 0.7673160173160174 23 0.7627705627705628 24 0.7718614718614718 26 0.7627705627705628 27 0.7675324675324675 28 0.7629870129870129 29 0.7766233766233765 31 0.7629870129870129 32 0.772077922077922 33 0.7675324675324674 34 0.7448051948051948 36 0.7629870129870129 37 0.7445887445887445 38 0.7536796536796537 39 0.7627705627705628 41 0.7627705627705628 42 0.7536796536796536 43 0.7582251082251082 44 0.7584415584415585 46 0.7584415584415585 47 0.762987012987013 48 0.7443722943722944 49 0.758008658008658 51 0.7625541125541125 52 0.758008658008658 53 0.7534632034632034 54 0.758008658008658 56 0.7489177489177489 57 0.7352813852813853 58 0.7352813852813853 59 0.7214285714285714 61 0.7032467532467532 62 0.6941558441558442 63 0.7032467532467533 64 0.7077922077922079 66 0.6941558441558441 67 0.675974025974026 68 0.675974025974026 69 0.666883116883117 71 0.6712121212121211 72 0.6530303030303031 73 0.6393939393939394 74 0.6300865800865803 76 0.6209956709956709 77 0.6164502164502164 78 0.607142857142857 79 0.5935064935064934 81 0.5935064935064934 82 0.598051948051948 83 0.5889610389610389 84 0.5889610389610389 86 0.5798701298701298 87 0.5932900432900432 88 0.5841991341991342 89 0.5839826839826839 91 0.5703463203463203 92 0.5748917748917749 93 0.5658008658008657 94 0.5567099567099567 96 0.566017316017316 97 0.5478354978354978 98 0.5432900432900433 99 0.5387445887445887 101 0.5432900432900433 102 0.5387445887445887 103 0.5432900432900433 104 0.5341991341991341 106 0.5203463203463203 107 0.5294372294372294 108 0.533982683982684 109 0.5158008658008658 111 0.5203463203463203 112 0.5203463203463203 113 0.5203463203463203 114 0.5203463203463203 116 0.5021645021645021 117 0.4976190476190476 118 0.4976190476190476 119 0.4976190476190476 121 0.48398268398268396 122 0.48852813852813853 123 0.4930735930735931 124 0.4930735930735931 126 0.48852813852813853 127 0.48398268398268396 128 0.47943722943722944 129 0.48398268398268396 131 0.48398268398268396 132 0.48852813852813853 133 0.4930735930735931 134 0.47943722943722944 136 0.47943722943722944 137 0.47943722943722944 138 0.48852813852813853 139 0.47943722943722944 141 0.4930735930735931 142 0.48852813852813853 143 0.474891774891775 144 0.474891774891775 146 0.47489177489177486 147 0.47489177489177486 148 0.474891774891775 149 0.474891774891775 151 0.4703463203463204 152 0.46580086580086577 153 0.4703463203463204 154 0.474891774891775 156 0.48398268398268396 157 0.4930735930735931 158 0.5112554112554113 159 0.5021645021645021 161 0.49783549783549785 162 0.4932900432900433 163 0.49350649350649345 164 0.507142857142857 166 0.498051948051948 167 0.4932900432900433 168 0.4796536796536796 169 0.47510822510822515 171 0.48441558441558435 172 0.4796536796536796 173 0.4796536796536796 174 0.4705627705627705 176 0.4705627705627705 177 0.46580086580086577 178 0.461038961038961 179 0.45649350649350656 181 0.45649350649350656 182 0.45649350649350656 183 0.45649350649350656 184 0.46580086580086577 186 0.451948051948052 187 0.4428571428571429 188 0.4292207792207792 189 0.4337662337662337 191 0.42467532467532465 192 0.42012987012987013 193 0.41558441558441556 194 0.4012987012987013 196 0.32835497835497834
#---calculate misclassification error for each k---
MSE = [1 - x for x in cv_scores]
#---determining best k (min. MSE)---
optimal_k = ks[MSE.index(min(MSE))]
print(f"The optimal number of neighbors is {optimal_k}")
#---plot misclassification error vs k---
plt.plot(ks, MSE)
plt.plot(optimal_k, MSE[optimal_k-1], 'r', marker='*', label='optimal k')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error (MSE)')
plt.legend()
plt.show()
The optimal number of neighbors is 8
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(175, 11) (175,) (44, 11) (44,)
k = optimal_k
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=8)
y_pred = knn_model.predict(X_test)
mat_test = confusion_matrix(y_test, y_pred)
print('confusion matrix = \n', mat_test)
confusion matrix = [[ 6 0 0 0 0] [ 0 6 0 1 1] [ 0 0 11 0 0] [ 0 1 0 8 1] [ 0 4 0 2 3]]
types = subsetsall['type'].unique()
types = list(types)
fig, ax2 = plt.subplots(1, 1, figsize=(4, 4))
cm = confusion_matrix(y_test, y_pred)
ax2 = sns.heatmap(cm, annot=True, square=True, xticklabels=types, yticklabels=types)
ax2.set_xlabel('Predicted Labels')
ax2.set_ylabel('Actual Labels')
Text(17.25, 0.5, 'Actual Labels')
print(classification_report(y_test, y_pred, target_names=types))
precision recall f1-score support
PO 1.00 1.00 1.00 6
SL 0.55 0.75 0.63 8
LP 1.00 1.00 1.00 11
LF 0.73 0.80 0.76 10
CR 0.60 0.33 0.43 9
accuracy 0.77 44
macro avg 0.77 0.78 0.76 44
weighted avg 0.77 0.77 0.76 44
The overall accuracy is lower here than when using only 2 features and 3 weld defect types with k=5 neighbors. If you are knowledgeable in the machine learning space, this can be expected as the model is more complex and has more possibly useless or detrimental data to work with. (However, it is still a decent accuracy and we can see that the model is not overfitting due to k fold cross validation.)
We can use this model for predicting PO and LP weld defect types as they have a perfect f1-score [at least in this random split of the data]. Also, this possibly could have predicted much earlier by looking at theses two types' density in the inital pairplot (namely ar vs ar and sp vs sp, but high in others as well).
More curated features (such as in our first KNN explored here) and perhaps less type possibllities (by first filtering out PO and LP for example) would be better for predicting the other types.
Mainly, this exploration of the data (first with a curated selection and then a full set) shows the importance of tuning the model with the best features to show the best type splits. It is not always best to use all features and all types. You can't just throw them all in and expect the best results. Muliple models should be made and used in conjunction with each other to get the best prediction results, in this weld defect dataset and in general.